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ABSTRACT 

Many carcinogens leave a unique mutational 
fingerprint in the human genome. These mutational 
fingerprints manifest as specific types of mutations 
often clustering at certain genomic loci in tumor 
genomes from carcinogen-exposed individuals. To 
develop a high-throughput method for detecting the 
mutational fingerprint of carcinogens, we have 
devised a cost-, time- and labor-effective strategy, 
in which the widely used transgenic Big Blue® 
mouse mutation detection assay is made compat- 
ible with the Roche/454 Genome Sequencer FLX 
Titanium next-generation sequencing technology. 
As proof of principle, we have used this novel 
method to establish the mutational fingerprints of 
three prominent carcinogens with varying muta- 
genic potencies, including sunlight ultraviolet radi- 
ation, 4-aminobiphenyl and secondhand smoke that 
are known to be strong, moderate and weak 
mutagens, respectively. For verification purposes, 
we have compared the mutational fingerprints of 
these carcinogens obtained by our newly developed 
method with those obtained by parallel analyses 
using the conventional low-throughput approach, 
that is, standard mutation detection assay 
followed by direct DNA sequencing using a capillary 
DNA sequencer. We demonstrate that this 
high-throughput next-generation sequencing-based 
method is highly specific and sensitive to detect the 
mutational fingerprints of the tested carcinogens. 
The method is reproducible, and its accuracy is 
comparable with that of the currently available 
low-throughput method. In conclusion, this novel 
method has the potential to move the field of car- 
cinogenesis forward by allowing high-throughput 



analysis of mutations induced by endogenous and/ 
or exogenous genotoxic agents. 



INTRODUCTION 

The human cancer genome is shaped by assaults from 
endogenous and exogenous mutagens (1). Many carcino- 
gens are mutagens or turn into mutagenic derivatives 
through biotransformation. Of these, some are known to 
leave a unique mutational fingerprint in the human 
genome (2,3). These mutational fingerprints manifest as 
specific types of mutations (e.g. induced base substitu- 
tion/deletion/insertion), often clustering at certain nucleo- 
tide positions in cancer-related loci, in tumor genomes 
from carcinogen-exposed individuals (4,5). Establishing 
the mutational fingerprint of carcinogens is important 
because (i) from a mechanistic point of view, it can help 
infer human cancer etiology; and (ii) from a standpoint of 
public health, it can help reinforce hazard removal/reduc- 
tion strategies for perilous environmental agents. Until 
recently, the mutational fingerprint of carcinogens has 
only been investigated in a few cancer-related genes or 
housekeeping genes (4). With the advent of next- 
generation sequencing technologies, however, a compre- 
hensive mutational fingerprint of carcinogens can now 
be determined on a genome-wide scale (6,7). These break- 
through technologies are poised to survey the landscape of 
human cancer genome and reveal mutational fingerprints, 
which may be ascribed to environmental carcinogens (8). 
However, to verify causality, the identified mutational fin- 
gerprints need to be experimentally recapitulated in 
validated model systems and under strictly controlled 
exposure conditions (4,9). 

Transgenic rodents are extensively validated model 
systems for establishing the mutational fingerprint of car- 
cinogens (10). However, the mutation detection assays 
incorporated into these transgenic systems are only 
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amenable to conventional DNA sequencing analysis (4). 
This feature is prohibitive because it only allows iow- 
throughput' detection of mutational fingerprint using 
direct DNA sequencing of phenotypically expressed indi- 
vidual mutants. Such an approach is relatively costly, 
extensively time consuming and extremely laborious (4). 
To develop a 'high-throughput' method for detecting the 
mutational fingerprint of carcinogens, we have devised a 
cost-, time- and labor-effective strategy, in which the 
widely used transgenic Big Blue® mouse mutation detec- 
tion assay (Stratagene, La Jolla, CA) is made compatible 
with the Roche/454 Genome Sequencer FLX Titanium 
next-generation sequencing technology (454 Life 
Sciences, Branford, CT). As proof of principle, we have 
used this novel method to establish the mutational finger- 
prints of three carcinogens with varying mutagenic 
potencies, including sunlight ultraviolet radiation, 
4-aminobiphenyl (4-ABP) and secondhand smoke 
(SHS) that are known to be strong, moderate and weak 
mutagens, respectively (11-13). For verification purposes, 
we have compared the mutational fingerprints of these 
carcinogens obtained by our newly developed method 
with those obtained by parallel analyses using the conven- 
tional low-throughput approach, that is, standard 
mutation detection assay followed by direct DNA 
sequencing using the capillary ABI-3730 DNA Analyzer 
(ABI Prism, PE Applied BioSystems, Foster City, CA). 
We have also performed similar analyses to establish the 
spontaneous mutation spectra in control (sham-treated) 
samples using both the new next-generation sequencing- 
based method and the conventional DNA sequencing. 

MATERIALS AND METHODS 

Selection of carcinogens and experimental treatments 

To demonstrate the sensitivity and specificity of our 
method for detecting the mutational fingerprint of car- 
cinogens, we chose three distinct agents with high, mod- 
erate and low mutagenic potencies, respectively. As an 
extensively studied environmental physical carcinogen, 
the ultraviolet B (UVB) fraction of sunlight (X: 280- 
320 nm) is implicated in the etiology of human skin 
cancer and proven to be a highly potent mutagen 
(14-16). The aromatic amine 4-ABP is a widespread en- 
vironmental contaminant, which is present in various oc- 
cupational settings and tobacco smoke and considered an 
etiologic agent in human bladder cancer (17,18). 4-ABP is 
known to be moderately mutagenic (13). SHS is an envir- 
onmental pollutant, which is etiologically implicated in 
human lung cancer, and possesses relatively weak muta- 
genic potency (12,19). 

Detailed information on the experimental treatment of 
Big Blue® mouse embryonic fibroblasts with UVB and the 
chronic exposure of Big Blue® mice to 4-ABP or SHS are 
provided in our previously published reports (11,13,20). 
Briefly, the UVB irradiation of Big Blue® mouse embry- 
onic fibroblasts was performed in vitro at a single biologic- 
ally relevant dose of 75.6 ml/cm 2 , and under physiologic 
conditions (11). In vivo, 4-ABP (Sigma-Aldrich Inc., Saint 
Louis, MO) was administered intraperitoneally to male 



adult Big Blue® mice on a weekly basis for a duration of 
6 weeks at increasing doses of 25-100mg/kg bw (13). The 
SHS treatment of male adult Big Blue® mice was per- 
formed in vivo in exposure chambers of a TE-10 
smoking machine (Teague Enterprises, Davis, CA) for 
5hr/day, 5 days/week for a duration of 4 months (20). 
Subsequent to all experimental treatments, genomic 
DNA was isolated using a standard phenol extraction- 
based protocol (21). The DNA was dissolved in TE 
buffer (10 mM Tris-HCl, ImM EDTA, pH 7.5) and 
kept at — 80°C until further analysis. 

Modification of the Big Blue® mouse mutation detection 
assay for compatibility with a next-generation sequencing 
platform 

Transgenic Big Blue® rodent system is an extensively 
validated model for studying spontaneous or experimen- 
tally induced mutagenesis (4). The genome of these 
transgenic animals contains multiple copies of a chromo- 
somally integrated ALIZ shuttle vector, which carries two 
bacterial reporter genes, including the c//and lad (10). To 
investigate the experimental induction of mutagenesis, 
transgenic rodents or cell cultures derived from their 
organs of interest are treated with a test agent in vivo or 
in vitro, respectively. Following a latency period needed 
for the expression of mutations, genomic DNA is isolated, 
and the XLIZ shuttle vectors are recovered. The recovered 
vectors are then used in a bacterial phenotypic expression 
assay to identify mutants, that is, cells harboring mutations 
in the reporter gene(s) (10). To find the type and distribution 
of induced mutations in the ell or lad genes, which reflect 
the mutational fingerprint of the tested agent, each pheno- 
typically expressed mutant needs to be isolated individually 
and subjected to direct DNA sequencing. This is a time-, 
cost- and labor-intensive process, and as such, precludes 
'high-throughput 1 generation of mutational fingerprints 
(4). To address this issue, we have devised a novel 
strategy, in which a pool of phenotypically expressed 
mutants, in lieu of single mutants, can be sequenced using 
a next-generation sequencing platform. 

As the preparatory step, we performed the ell mutagen- 
esis assay on the genomic DNA of carcinogen-treated 
cells/mice and control (sham-treated) to phenotypically 
express the induced and spontaneously derived ell 
mutants, respectively. The assay was performed using 
the commercially available Transpack Packaging Extract 
kit (Stratagene) according to the instructions of the manu- 
facturer. Following the expression assay, 1 50 ell mutant 
plaques obtained from the analysis of genomic DNA from 
each of the experimental or control group were cored 
individually and placed in a microtube containing 
500 ul double-distilled water. We note that the pool of 
150 mutants per sample is comparable with the number 
of mutants sequenced individually by the conventional 
low-throughput method for establishing the mutational 
fingerprint of carcinogens (4). For verification of reprodu- 
cibility, two or more independent pools of 150 mutants 
were prepared from each experimental or control group 
simultaneously. The microtubes containing pools of 150 
mutant ell plaques were boiled for 5min, and 
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subsequently, centrifuged at 1 8 OOOg for 5 min. Ten micro- 
liters of the supernatant were immediately transferred to a 
new microtube containing 40 ul of a polymerase chain 
reaction (PCR) mastermix in which the final concentra- 
tions of the reagents were lx PCR buffer, lx Q 
solution, 200 nM each of the forward and reverse 
primers, 50 uM dNTP and 2.5 U Tag DNA Polymerase 
(Qiagen, Valencia, CA). The oligonucleotide primers 
were custom designed to contain the forward and 
reverse ell sequences (needed for amplification of the 
entire ell gene and its flanking regions) together with 
tagged linker sequences (required for downstream appli- 
cation of the Genome Sequencer FLX Titanium 
next-generation sequencing). The forward and reverse 
primers were 5'-cgtatcgcctccctcgcgccatcagccgctcttaca- 
cattccagc (t m : 72.6°C) and 5'-ctatgcgccttgccagcccgct- 
cagcctctgccgaagttgagtat (/,„: 72.8°C), respectively. The 
thermocycling conditions were as follows: denaturation 
at 95°C for 3 min, 10 cycles of amplification consisted of 
45 s at 95°C, 1 min at 60°C, and 1 min at 72°C, 25 cycles of 
re-amplification consisted of 45s at 95°C and 2min 
at 73°C and finally 7 min of extension at 73°C. The PCR 
amplified product (526 bp) was purified using the 
QIAquick PCR purification kit (Qiagen) and kept at 
— 80°C until further analysis. 

Genome sequencer FLX titanium next-generation 
sequencing and bioinformatics data processing and 
analysis 

Ultradeep pyrosequencing was performed using a 454 GS 
FLX (454 Life Sciences). The amplified PCR products 
encompassing the entire ell gene and its flanking regions 
plus the 454 (A) and (B) linkers (454 Life Sciences) were 
further purified by the MinElute PCR purification kit 
(Qiagen). The resultant was clonally amplified on 
capture beads in water-in-oil emulsion microreactors 
(454 Life Sciences). The enriched-DNA beads were de- 
posited onto the wells of a full Roche 454 FLX 
Titanium PicoTiter Plate device and pyrosequenced in 
both forward and reverse directions. The 200-nucleotide 
cycles were carried out in a 10-hr sequencing run, accord- 
ing to the manufacturer's instructions (454 Life Sciences). 
Sequence reads were generated in FASTQ format using 
the Data Processing Pipeline (v2.3) of the GS FLX System 
software (454 Life Sciences). The sequence reads were 
filtered by read length (within 350-550 bp limit). The 
filtered sequence reads were aligned to the reference 
sequence using CLCBIO Genomic Workbench's (v4.5) 
long read alignment tool. The variations of each sample 
were detected using CLCBIO Genomics Workbench's 
(v4.5) SNP/DIP analysis tools, which is based on the 
Neighborhood Quality Standard algorithm (22). Based 
on the total number of mutants per sample, minimum 
variation frequency and read number were set as thresh- 
olds to detect base substitutions and insertions and dele- 
tions (Indel). Because each sample contained a pool of 150 
mutants, we used 0.66% (1 of 150) as minimum threshold 
for detecting base substitutions or Indels. The minimum 
variation frequency was also used as benchmark to calcu- 
late the total number of each specific type of mutation in 



each sample. For example, a 6.6% of C— >-T base substi- 
tution in an individual sample was counted as 10 mutants 
that have this type of mutation in that sample. The vari- 
ation distribution was calculated based on the total 
number of mutated amplicons in each sample. The vari- 
ation spectrum of each sample was plotted on the refer- 
ence sequence with a heatmap to improve the visualization 
of variations. To minimize homopolymer sequencing 
errors, which may cause assembly errors, false variations 
(23) or reduced quality score of the reads (24), we imple- 
mented a filtration step that uses the 'high-quality' read 
coverage threshold and the variation status. The filter ini- 
tially scans the reference sequence and identifies 
homopolymer region (length: greater than or equal to 3). 
Subsequently, it counts the coverage of these 'high- 
quality' reads (Phred quality: greater than or equal to 
30) relative to the variation in homopolymer region and 
filters out the variations with low 'high-quality' read 
coverage. The filter then compares the variation within 
the homopolymer and at its neighboring nucleotide. If 
the variation within the homopolymer is the same as 
that at its adjacent nucleotide, the filter drops this vari- 
ation and considers it to have arisen from a homopolymer 
sequencing error and/or assembly error. Supplementary 
Figure SI shows an example of this type of false variation 
caused by a homopolymer error. In this example, there are 
two variations, including A^-C and C^A. The A^-C 
variation within the homopolymer region is followed by 
the C^-A variation at its neighboring base. 

Statistical and bioinformatics analyses 

The results are expressed as mean ± 95% confidence inter- 
val. Comparison of mutant frequency data between an 
experimental group and its corresponding control group 
was made using the Wilcoxon rank-sum test. To determine 
the reproducibility of mutation spectra established in du- 
plicate/multiplicate samples analyzed by the next-gene- 
ration sequencing-based method, we performed both the 
hierarchical clustering analysis and the principle compo- 
nent analysis (PCA) using the Partek Genomics Suite 
v6. 11.1116 (http://www.partek.com). To further analyze 
the comparability of mutation spectra established in 
duplicate/multiplicate samples analyzed by the next- 
generation sequencing-based method, we performed cor- 
relation analysis to calculate the similarities between the 
frequency and position of each mutation detected in the 
respective samples. In addition, we used correlation 
analysis to compare the mutation spectra of each set of 
two matching samples established by the next-generation 
sequencing-based method and the conventional DNA 
sequencing, respectively. We note that this correlation 
analysis takes into account the similarities between the 
frequency and position of each mutation detected by 
the respective methods in two counterpart samples. The 
applied correlation analysis uses stringent criteria to 
compare two mutation spectra with respect to both the 
frequency and type of each specific mutation occurred in 
the entire length of the c//gene. This comparative analysis 
takes into consideration the frequency and type of muta- 
tions in the ell gene (as a whole) but not at certain 
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nucleotide positions only. So, the overall mutation fre- 
quency and pattern across the full length of the ell 
sequence are compared when the above correlation 
analysis is performed on two mutation spectra. All statis- 
tical tests were two sided. Values of P < 0.05 were con- 
sidered statistically significant. The S-Plus 7.0 for 
Windows software (Insightful Corp., Seattle, WA) was 
used for statistical analysis. 

RESULTS 

Mutant frequency and mutation spectrum 

As a potent mutagen implicated in skin carcinogenesis 
(11), sunlight UVB caused significant mutagenicity 
in Big Blue® mouse embryonic fibroblasts irradiated 
in vitro with this environmental carcinogen. The 
strong mutagenicity of UVB was demonstrated by a 
92.6-fold increase in background ell mutant frequency 
from 3.01 ± 0.68 x 10~ 5 in control (non-irradiated) cells 
to 278.92 ± 20.63 x 10" 5 in UVB-irradiated cells 
(P = 0.0002). In vivo treatment of Big Blue® mice with 
4-ABP, a known bladder carcinogen with moderate 
mutagenic potency (13), resulted in a 9.9-fold increase 
in background ell mutant frequency from 
2.09 ± 0.20 x 10" 5 in bladder DNA of control (solvent- 
treated) mice to 20.62 ± 4.77 x 10" 5 in bladder DNA of 
4-ABP-treated mice (P = 0.0079). In vivo exposure of Big 
Blue® mice to SHS, a known pulmonary carcinogen with 
comparatively weak mutagenic potency (12), resulted in a 
2.1-fold increase in background ell mutant frequency 
from 2.00 ± 0.29 x 10" 5 in lung DNA of control 
(clean-air-treated) mice to 4.09 ± 0.79 x 10~ 5 in lung 
DNA of SHS-treated mice (P = 0.0011). 

To determine what specific type(s) of mutation have 
caused the significant increase in ell mutant frequency 
in carcinogen-treated cells/mice relative to control, we 
computed the absolute mutant frequency of each type of 
mutation in the ell gene (i.e. transitions, transversions, 
deletions and insertions) in the genome of carcinogen- 
treated cells/mice and control. As shown in Figure 1A, 
the absolute mutant frequencies of G:C^C:G 
transversions, G:C— >-T:A transversions, G:C— >A:T transi- 
tions, A:T^T:A transversions, A:T^G:C transitions, 
A:T^C:G transversions and insertions/deletions were all 
increased, although to different extents, in the ell gene in 
genomic DNA of UVB-irradiated cells relative to control. 
The percentage contributions of the respective types of 
mutation to the overall increase in ell mutant frequency 
in UVB-irradiated cells were 1.0, 0.2, 87.0, 3.1, 0.6, 1.4 and 
6.7 (Figure IB). More specifically, mutations occurring at 
dipyrimidine sites account for nearly all the induced 
ell mutations in UVB-irradiated cells. Of these, 
G:C^-A:T transition mutations, which comprise the 
majority of all the induced ell mutations (87.0%), are the 
main contributor to the overall increase in ell mutant fre- 
quency in UVB-irradiated cells (Figure 1A and B and 
Supplementary Table SI). 

As illustrated in Figure 2A and B, the percentage 
contributions of G:C— >-C:G transversions, G:C^T:A 
transversions, G:C^A:T transitions, A:T^-T:A 



transversions, A:T— ^G:C transitions, A:T^C:G 
transversions and insertions/deletions to the overall 
increase in ell mutant frequency in bladder DNA of 
4-ABP-treated mice were 15.3, 40.0, 20.4, 5.7, 8.7, 1.8 and 
8.1, respectively. Specifically, mutations occurring at G:C 
basepairs account for 81.2% of all the induced ell muta- 
tions in bladder DNA of 4-ABP-treated mice. Of these, 
G:C^-T:A transversion mutations, which constitute 40% 
of all the induced ell mutations, dominate the overall 
increase in ell mutant frequency in bladder DNA of 
4-ABP-treated mice (Figure 2A and B and Supplementary 
Table SI). As shown in Figure 3A and B, the percentage 
contributions of G:C— ^C:G transversions, G:C^T:A 
transversions, G:C— s-A:T transitions, A:T^-T:A 
transversions, A:T— >-G:C transitions and A:T^-C:G 
transversions to the overall increase in ell mutant frequency 
in lung DNA of SHS-exposed mice were 6.3, 15.7, 49.2, 7.4, 
15.9 and 10.1, respectively. Thus, G:C— ^A:T transition mu- 
tations account for nearly half of all the induced ell muta- 
tions in the lung DNA of SHS-exposed mice (Figure 3 A and 
B and Supplementary Table SI). 

We then mapped the locations of induced mutations in 
the ell gene in the genome of carcinogen-treated cells/mice 
by plotting the induced mutations versus control 
(spontaneously derived) mutations along the reference 
ell sequence. As shown in Figure 1C and D, the 
UVB-induced mutations occurred at specific nucleotide 
positions in ell gene, which were distinct from those loci 
at which spontaneous mutations occurred in control. 
These UVB-specific mutations clustered at several 
nucleotide positions, predominantly within dipyrimidine- 
sequence contexts, and were almost exclusively G:C^-A:T 
transitions (Figure 1C and Supplementary Figure S2). The 
overall spectrum of induced mutations in the ell gene of 
UVB-irradiated cells is comparable with that previously 
found in the same model system using the conventional 
low-throughput method (P < 0.0001, Figure 1A-C, 
Supplementary Figure S2 and Supplementary Table SI). 

Mapping of the induced ell mutations in the genome of 
4-ABP-treated mice showed that the majority of muta- 
tions were located at G:C basepairs (Supplementary 
Figure S3). Of these, G:C^T:A transversions clustering 
at several codon positions in the ell gene were specific for 
4-ABP treatment (Supplementary Figure S3). This 
spectrum of induced mutations in the ell gene of 
4-ABP-treated mice is also comparable with that previ- 
ously found in the same model system using the conven- 
tional low-throughput method (P = 0.004, Figure 2A-C, 
Supplementary Figure S3 and Supplementary Table SI). 

Furthermore, mapping of the induced ell mutations 
in the genome of SHS-treated mice revealed that most 
mutations were localized to G:C basepairs (74.9%, 
Supplementary Figure S4). There were subtle differences 
between the locations of SHS-induced mutations and the 
spontaneously derived mutations in the ell gene in lung 
DNA from SHS-treated mice and control, respectively. In 
addition, the frequencies of mutation at certain loci along 
the ell gene in SHS-treated mice were slightly different 
from those in control. These subtle differences in the 
type and location of SHS-induced and control ell muta- 
tions concur with the weak mutagenicity of SHS and 
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Figure 1. ell mutant frequency and mutation spectrum in UVB-irradiated cells versus control. Mutation analysis of the ell gene in mouse embryonic 
fibroblasts irradiated with UVB or control was performed using the ell mutagenesis assay, as described in 'Materials and Methods'. (A) Absolute 
mutant frequency of each specific type of mutation in the ell gene of UVB-irradiated cells or control as determined by both the new method (NGS) 
and the conventional method (ABI). Average results (bars) from multiple analyses plus 95% confidence interval (error bars) are shown. (B) 
Percentage increase in frequency of each specific type of mutation in the ell gene of UVB-irradiated cells or control as determined by both the 
new method (NGS) and the conventional method (ABI). Results are expressed as 'induced mutation (%)', which is calculated as [(mutant frequency 
of each type of mutation in UVB group — mutant frequency of the respective type of mutation in control group x 100)/(overall induced mutant 
frequency in UVB group — overall spontaneous mutant frequency in control group)]. Distribution of mutations in the ell gene of UVB-irradiated 
cells (C) and control (D) as determined by the new method (NGS) and/or the conventional method (ABI). 
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Figure 2. ell mutant frequency and mutation spectrum in 4-ABP-treated mice versus control. Mutation analysis of the ell gene in bladder DNA of 
mice treated with 4-ABP or control was performed using the ell mutagenesis assay, as described in 'Materials and Methods'. (A) Absolute mutant 
frequency of each specific type of mutation in the ell gene of 4-ABP-treated mice or control as determined by both the new method (NGS) and the 
conventional method (ABI). (B) Percentage increase in frequency of each specific type of mutation in the ell gene of 4-ABP-treated mice or control 
as determined by both the new method (NGS) and the conventional method (ABI). Results are expressed as 'induced mutation (%)', which is 
calculated as [(mutant frequency of each type of mutation in 4-ABP group — mutant frequency of the respective type of mutation in control 
group x 100)/(overall induced mutant frequency in 4-ABP group — overall spontaneous mutant frequency in control group)]. Distribution of muta- 
tions in the ell gene of 4-ABP-treated mice (C) and control (D) as determined by the new method (NGS) and/or the conventional method (ABI). 
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Figure 3. ell mutant frequency and mutation spectrum in SHS-treated mice versus control. Mutation analysis of the ell gene in lung DNA of mice 
exposed to SHS or control was performed using the ell mutagenesis assay, as described in 'Materials and Methods'. (A) Absolute mutant frequency 
of each specific type of mutation in the ell gene of SHS-exposed mice or control as determined by both the new method (NGS) and the conventional 
method (ABI). (B) Percentage increase in frequency of each specific type of mutation in the ell gene of SHS-exposed mice or control as determined 
by both the new method (NGS) and the conventional method (ABI). Results are expressed as 'induced mutation (%)', which is calculated as [(mutant 
frequency of each type of mutation in SHS group — mutant frequency of the respective type of mutation in control group x 100)/(overall induced 
mutant frequency in SHS group — overall spontaneous mutant frequency in control group)]. Distribution of mutations in the ell gene of 
SHS-exposed mice (C) and control (D) as determined by the new method (NGS) and/or the conventional method (ABI). 
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Figure 4. Spontaneous ell mutation spectrum in control (sham-treated) mice. Mutation analysis of the ell gene in lung DNA of control mice (clean 
air exposed) was performed using the ell mutagenesis assay, as described in 'Materials and Methods'. (A) Absolute mutant frequency of each specific 
type of mutation in the ell gene of control mice as determined by both the new method (NGS) and the conventional method (ABI). (B) Percentage of 
each specific type of mutation in the ell gene of control mice as determined by both the new method (NGS) and the conventional method (ABI). 
(C) Distribution of mutations in the ell gene of control mice as determined by both the new method (NGS) and the conventional method (ABI). 



are consistent with the results found previously in the 
same model system using the conventional low-through- 
put method (P < 0.0001, Figure 3A-C, Supplementary 
Figure S4 and Supplementary Table SI). 

Finally, we established the spontaneous mutation 
spectrum in control (sham-treated) samples using our 
next-generation sequencing-based method. As shown in 
Figure 4A and B and Supplementary Table SI, the 
percentages of G:C^C:G transversions, G:C^T:A 
transversions, G:C^A:T transitions, A:T^-T:A 
transversions, A:T^G:C transitions, A:T— ^C:G transver- 
sions and insertions/deletions in the ell gene of control 
genomic DNA were 1.4, 3.8, 72.9, 2.5, 4.0, 5.5 and 9.9, 
respectively. Of these, the majority occurred at 5'-CpG 



dinucleotides, with G:C^A:T transitions being the 
predominant type of mutations (i.e. over 90% of all 
mutations occurring at CpG-containing sequences 
were G:C^-A:T transitions) (Supplementary Figure S5). 
This spectrum of spontaneous ell mutations in the 
genomic DNA of control (sham-treated) cells is comparable 
with that previously found in the same model system using 
the conventional low-throughput method (P < 0.0001, 
Figure 4A and B, Supplementary Figure S5 and 
Supplementary Table SI). Altogether, the data indicate 
that our new method can sensitively and specifically detect 
the mutational fingerprint of three prominent carcinogens 
with varying mutagenic potencies, including UVB, 4-ABP 
and SHS, as well as establish the spectrum of spontaneous 
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mutations in control. The levels of sensitivity and specificity 
of our new method are comparable with those of the cur- 
rently available low-throughput method. 

Hierarchical clustering analysis and the 
principle component analysis 

We determined the reproducibility of the results obtained 
by our new method by assaying duplicate/multiplicate 
samples in a single run (to examine intra-assay variation) 
and/or in two different runs (to examine inter-assay vari- 
ation). We prepared two independent pools of 150 
ell mutants from UVB-irradiated cells and analyzed 
them in a single assay run. In addition, we used another 
three independent pools of 150 mutants from the 
UVB-irradiated cells and analyzed them in a subsequent 
assay run. In all cases, we verified reproducible results for 
the duplicate/triplicate samples analyzed in a single run 
as well as in two independent runs (Figure 5). More spe- 
cifically, the heatmap generated by the Hierarchical 
Clustering Analysis, which uses the Pearson's Dissimi- 
larity to measure differences between frequency and 
position of each mutation occurred amongst different 
samples, showed that all the UVB-irradiated samples clus- 
tered very closely together (Figure 5A). This observation 
was further confirmed by the PCA mapping that showed 
that all the UVB-irradiated samples grouped together and 
remained distant from other differently treated or control 
samples (Figure 5B). 

Furthermore, we analyzed duplicate 4-ABP-treated 
samples in a single assay run as well as in a subsequent 
run. As shown in Figure 5, comparable results were 
obtained from the analysis of the above-specified 
samples. The heatmap generated by the Hierarchical 
Clustering Analysis (Figure 5A) and the PCA mapping 
(Figure 5B) revealed that all the 4-ABP-treated samples 
clustered closely together and stayed separated from other 
differently treated or control samples. Moreover, we 
examined duplicate SHS-treated samples in a single 
assay run and in a subsequent run. As shown in 



Figure 5, comparable results were obtained from the 
analysis of the above SHS-treated samples, as reflected 
by the clustering of all SHS-treated samples together 
(Figure 5A), as well as mapping of these samples closely 
to each other, while being apart from other differently 
treated samples (Figure 5B). Given the weak mutagenicity 
of SHS, it is also of note that the SHS-treated samples did 
not map too far from the control samples (Figure 5B). 

We have also analyzed duplicate control samples in a 
single assay run as well as in a subsequent run. As shown 
in Figure 5, comparable results were obtained from the 
analysis of the above control samples. The heatmap 
generated by the Hierarchical Clustering Analysis (Figure 
5A) and the PCA mapping (Figure 5B) showed that all the 
control samples clustered together and stayed distant 
from other differently treated samples. These data validate 
the reproducibility of our new high-throughput next- 
generation sequencing-based method for detecting both 
the carcinogen-induced and control mutation spectra. 

Read coverage analysis 

To demonstrate the sensitivity and specificity of our 
next-generation sequencing-based method and its efficient 
read coverage for the detection of experimentally induced/ 
control mutations we performed a read coverage analysis 
on all differently treated samples and control. For each 
sample, we used the total number of reads as benchmark 
and then randomly selected 5x, 10 x, 20 x, 35 x, 50 x and 
lOOx coverage (e.g. 5x coverage for each sample contain- 
ing a pool of 150 mutants is 750 reads). Except for lOOx 
coverage, where the full reads were used once, for all other 
coverage analyses (5x, lOx, 20x, 35x and 50x), we per- 
formed the random selection of reads 2-A times and used 
the average results. Note that 'coverage' here refers to the 
average number of reads sequenced per sample. For each 
coverage analysis, we calculated (i) the minimum true 
mutation, which is the percentage of mutations detected 
in the x randomly selected reads that can also be found in 
the full reads (lOOx) and (ii) the maximum false mutation, 
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which is the number of mutations detected in the x 
randomly selected reads that are not detectable in the 
full reads. As shown in Figure 6, in all cases, 20 x 
coverage was sufficient to yield >97% minimum true mu- 
tation and <2 maximum false mutation. From 20 x 
coverage onward, the minimum true mutation began to 
approach 100%, whereas the maximum false mutation 
started to reach negligible level. 

To specifically address the effect of depth of coverage on 
the detection of mutation, we then grouped the true 
mutation ratios and false mutation counts into four 
categories, including (i) 5x coverage, (ii) 10 x coverage, 
(iii) 20 x coverage and (iv) 35 x and plus coverage. The 
analysis of variance and Robust Test of Equality of 
Means results revealed that the depth of coverage (up 
until 20 x) has a significant effect on the true mutation 
ratio, F (3, 20.53) = 20.27 (P < 0.0001) (Supplementary 
Table S2). Post hoc comparisons using Games-Howell's 
test showed that the mean of true mutation ratio at 10 x 
coverage (M = 0.9429, SD = 0.041) was not significantly 
different from that at 5x coverage (M = 0.9269, 
SD = 0.054) (P = 0.797), whereas this value differed signifi- 
cantly from that at 20 x coverage (M = 0.9845,SD = 0.017) 
(P = 0.014). However, the mean of true mutation ratio at 
20 x coverage was not significantly different from that at 
35 x and plus coverage (M = 0.9999, SD = 0.0003). 
Together, these data indicate that the true mutation ratio 
is improved with increasing depth of coverage up until 20 x , 
after which there is no significant improvement. Likewise, 
similar analysis confirmed that the depth of coverage (up 
until 20 x) has a significant effect on the false mutation 
count, F (3, 28.34) = 26.55 (P < 0.0001); the false 
mutation count continues to reduce significantly with 
increasing depth of coverage up until 20 x, after which 
there is no significant reduction (Supplementary Table 
S2). Altogether, these data indicate that our next-generation 
sequencing-based method has more than sufficient read 
coverage (~5 times higher than required) to detect the ex- 
perimentally induced and control mutations with high sen- 
sitivity and specificity. 



DISCUSSION 

In this study, we have developed a high-throughput 
method for detecting the mutational fingerprint of car- 
cinogens by devising a cost-, time- and labor-effective 
strategy in which a widely used transgenic mutagenesis 
assay is made compatible with a next-generation sequen- 
cing platform. Accordingly, we have modified the Big 
Blue® mouse mutation detection assay and incorporated 
it into the Roche/454 Genome Sequencer FLX Titanium 
next-generation sequencing technology. In addition, we 
have set up a detailed bioinformatics approach to 
process and analyze the high volume sequencing data. 
We have used this novel method to detect the mutational 
fingerprints of three prominent environmental carcinogens 
with varying mutagenic potencies, including sunlight 
UVB, 4-ABP and SHS that are known to be strong, mod- 
erate and weak mutagens, respectively (11-13). Here, we 
demonstrate that our new method can detect the 



mutational fingerprints of these three carcinogens with 
high sensitivity and specificity. Furthermore, we verify 
that the accuracy and reproducibility of this method are 
comparable with those of the currently available low- 
throughput method. 

Using this new method, we have successfully established 
the mutational fingerprints of sunlight UVB, 4-ABP and 
SHS by detecting three distinct mutation spectra in the 
ell gene in the genomes of Big Blue® mice/cells treated 
with the respective carcinogens. The mutational finger- 
print of sunlight UVB was characterized by the prepon- 
derance of dipyrimidine-targeted mutations, which were 
predominantly G:C^A:T transitions, and clustered 
at several codon positions in the ell gene in the genome 
of UVB-irradiated cells (Figure 1 B and C, Supplementary 
Figure S2 and Supplementary Table SI). The 
4-ABP-induced mutational fingerprint manifested as the 
prevailing G:C basepair-localized mutations, which were 
mostly G:C^T:A transversions, and occurred frequently 
at several codons in the ell gene in the genome of 
4-ABP-treated mice (Figure 2B and C, Supplementary 
Figure S3 and Supplementary Table SI). In the case of 
SHS, a subtle, yet, distinguishable mutational fingerprint 
was established as the induced ell mutations, mostly being 
G:C— >-A:T transitions, were localized to G:C basepairs 
in the genome of SHS-treated mice (Figure 3B and C, 
Supplementary Figure S4 and Supplementary Table SI). 
The above-specified mutational fingerprints of these three 
carcinogens are comparable with those found previously 
in the same model system using the conventional low- 
throughput method (11-13). 

The tested carcinogens in this study are known 
to induce predominantly base substitutions, a type of 
mutation that is effectively detectable by the Big Blue® 
mutation assay (10,25). The successful application of 
our new method for the detection of mutational finger- 
print of tested carcinogens indicates that this method is 
suitable for establishing the mutational signature of a wide 
range of carcinogens. In addition, the method is flexible 
to be coupled with other transgenic or non-transgenic 
mutation detection assays if the modifications described 
here are implemented, accordingly. For instance, the gpt 
delta transgenic mutation assay, which is optimized for the 
detection of small/large deletion mutations and point mu- 
tations (26), can easily be incorporated into this method to 
allow establishing the mutational fingerprint of other 
classes of carcinogens, for example, clastogens. Likewise, 
the hypoxanthine-guanine phosphoribosyltransferase 
(hprt) mutation assay (27) can be introduced into this 
new method to offer high-throughput detection of muta- 
tional fingerprint of carcinogens in an endogenous 
reporter gene of the human genome. 

Our overall findings show that the new method is superior 
to the conventional method for establishing the mutational 
fingerprint of carcinogens. Most importantly, the new 
method offers great advantages over the traditional 
method as it saves significant amounts of time, labor and 
cost. For example, the conventional method requires prep- 
aration, processing and analysis of a large number of 
mutants (individually), whereas the new method achieves 
this same objective by a single analysis of a pool of 
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Figure 6. Read coverage analysis for ell mutations in carcinogen-treated mice/cells versus control. For each sample, total number of reads was used 
as benchmark and subsequently, randomly selected 5x, lOx, 20x, 35x, 50x and lOOx coverage analyses were performed as described in the text. 



mutants (simultaneously). We have calculated the amounts 
of time and expenses that we spent on the analysis of all our 
tested samples using both the conventional and the new 
methods. According to our calculations, the new method 
is approximately 20 times less time consuming and 3.5 
times less costly than the conventional method. If the 
reduced workload of personnel is factored into these calcu- 
lations, the savings will become even greater. To reduce the 
cost of sequencing, one can also use barcoding for multi- 
plexing, that is, an auxiliary technique in which 
sample-specific barcoding adapters that include unique 
sequence tags and a restriction site are attached to individual 
samples. After pooling the tagged DNA samples, library 
preparation and sequencing, the tag sequences are used 
to identify the generated sequences that correspond to 
each original sample (28,29). Currently, work in our labora- 
tory is underway to use 12 different barcoded adapters in 
a single assay run, which will enable us to pool 12 independ- 
ent samples together and analyze them simultaneously 
in each of the 8 lanes of a Roche/454 Genome Sequencer. 
Prospectively, the incorporation of the barcoding approach 
into our new next-generation sequencing-based method 



will save greater amounts of time, labor and cost in future 
sequencing projects. 

As the next-generation sequencing technologies are con- 
stantly evolving and rapidly undergoing refinements, the 
cost of such analysis is expected to drop significantly 
(6-8). Due to financial constraints and lack of bioinfor- 
matics support, small laboratories may not be able to 
perform on-site next-generation sequencing work. 
Currently, however, many universities, research institutes 
and private companies have core facilities, which provide 
competitive next-generation sequencing services and bio- 
informatics data analysis to outside investigators. The 
accuracy and reproducibility of our new method, which 
is consistent with the known low error rate of the Roche/ 
454 platform (30), and its comparable sensitivity and spe- 
cificity with those of the existing method, together with the 
above-mentioned prospects are all indicatives of the po- 
tential of the new method for becoming the mainstay of 
mutational fingerprinting for carcinogens. 

Recently, Gilles et al. (31) have shown that the total 
error rates of the 454 GS-FLX Titanium instrument 
for the first 101 bases and for the full-length sequence 
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(average: 550 bases) are 0.49% and 1.07%, respectively. 
The majority of these errors could be ascribed to false in- 
sertions and deletions. Of note, insertions and deletions 
were the minor types of mutation detected in this study, 
whereas base substitutions comprised the predominant 
type of the detected mutations. Gilles et al. (31) have also 
demonstrated that high coverage can help to correct false 
mutations caused by random errors, and 5 x coverage was 
determined as the minimum coverage needed to achieve 
this goal. In our study, the sequence reads were in the 
range of 100-400 bases after being trimmed for the 
primer sequences at the beginning and end of the reads. 
Given the average length of the reads for our target 
sequence, high coverage and minor occurrence of inser- 
tions/deletions relative to base substitutions, we feel confi- 
dent that the 0.66% minimum threshold used for 
the detection of mutations in this study is sufficient to 
distinguish between true mutations and sequencing 
errors generated by the GS-FLX Titanium analysis. The 
reproducibility of the results obtained by our new 
next-generation sequencing-based method, as well as their 
comparability with those obtained by the conventional 
method reassure the sensitivity and specificity of the new 
method for detecting the mutational fingerprint of carcino- 
gens. Of note, we have also used a minimum threshold of 
0.33% and obtained similar results to those found using the 
0.66% minimum threshold (data not shown). Altogether, 
the read-out of interest in this study is the mutagen signa- 
ture, which is not substantially affected by the minimum 
threshold criteria. We stress that the comparable muta- 
tional signatures of all the tested carcinogens established 
by our new next-generation sequencing-based method and 
the conventional DNA sequencing confirm the adequacy of 
the minimum threshold criteria used for the detection of 
mutations in this study. 

We would like to acknowledge that, for comparability 
purposes, we have analyzed a pool of 150 mutants per 
sample by our new method, which is consistent with the 
conventional DNA sequencing approach, in which similar 
number of mutants is sequenced individually for establish- 
ing the mutational fingerprint of carcinogens. We note 
that in our preliminary studies, we have used pools of 50 
and 150 mutants, respectively, per sample, and analyzed 
them by our next-generation sequencing-based method, 
which yielded similar results in both cases. This observa- 
tion together with the finding that our next-generation 
sequencing-based method has more than sufficient read 
coverage (~5 times higher than required; Figure 6), and 
the fact that obtaining larger number of mutants, espe- 
cially in case of weak mutagens, may not necessarily 
prove practical, indicates that the pool of 150 mutants 
analyzed in this study is large enough for establishing 
the induced and control mutation spectra. We note that 
given the numerous mutable nucleotide positions in the ell 
gene, sequencing different number of mutants may reveal 
slightly different mutations detectable at various nucleo- 
tide positions in this gene; however, our data indicate that 
the overall spectrum of mutations in the full-length ell 
gene remains the same as long as an average of 150 
pooled mutants is used for DNA sequencing. 



Thus far, few studies have used next-generation 
sequencing technologies for the detection of mutations in 
foreign DNA (using cell free environment), for example, 
shuttle vector or RNA template or yeast (32-34). These 
elegant studies have confirmed the applicability of 
next-generation sequencing platforms for mutagenicity 
analysis (32-34). However, the mutation detection systems 
employed in these studies may not necessarily represent 
some of the key determinants of mutagenesis in mammalian 
cells, for example, chromatin structure, DNA sequence 
contexts, fidelity and efficiency of DNA polymerases and 
DNA repair (4,35-38). In addition, these systems are not 
suitable for investigating organ-specific mutagenicity in 
relation to tumorigenesis, which is a unique property of 
certain carcinogens (12,13). The latter is reflective of the 
need for studying target-organ mutagenesis in animal 
models of tumorigenicity. The current literature lacks a 
comprehensive study, in which the application of next- 
generation sequencing technologies for the detection of mu- 
tations in chromosomal genes of a mammalian system 
is explored. Our study is the first demonstration of the 
applicability of these technologies for the detection of 
mutational fingerprint of carcinogens in a chromosomal 
gene in a validated mammalian model system (4,10). 
In spite of the increasingly popular use of transgenic 
model system for mutational analysis of carcinogens, the 
system remains low-throughput and cost, time and labor 
ineffective (10). This study offers a new strategy to modify 
the mutation detection assay in this model system by making 
it compatible with a next-generation sequencing platform, 
thus, allowing high-throughput analysis of mutational fin- 
gerprint of carcinogens in a cost-, time- and labor-effective 
manner. 

In summary, we have developed a new next-generation- 
based method that can detect the mutational fingerprint of 
carcinogens with high sensitivity and specificity. In 
addition, we have shown that the accuracy and reprodu- 
cibility of this new method are comparable with those of 
the currently available low-throughput method. Given the 
accuracy and reproducibility, great expediency and speed, 
and labor, time and cost effectiveness of this method, the 
method is poised to be employed in large-scale screening 
projects for detecting mutagenic carcinogens and become 
the method of choice for high-throughput DNA- 
sequencing analysis. Prospectively, the method will have 
the potential to move the field of carcinogenesis forward 
by allowing high-throughput analysis of mutations 
induced by endogenous and/or exogenous genotoxins. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Tables 1 and 2 and Supplementary 
Figures 1-5. 
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